Analysis on Ford-GoBike Systems¶

by Chukwudi Okereafor¶

Investigation Overview¶

In this investigation, I wanted to look at the characteristics of these users to know the type that takes longer trips and when. I focused on the user_type, the member_age, hour_of_day and day_of_week.¶

Dataset Overview¶

This data set contains information of over 170000 rides taken from one station to another in San Francisco Bay area, with over 4000 bikes used. The attributes included duration (in seconds), user_type, start_time, end time, as well as additional information such as bike_id, start_station (id, name, longitude and latitude),end station (id, name, longitude and latitude), member_birth_year, and member_gender. Rows with missing values were dropped and 3 features were created (hour_of_day, day_of_week and member_age) from start_time and member_birth_yea`.¶

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import calendar

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
In [2]:
# load in the dataset into a pandas dataframe
gobike_df = pd.read_csv('201902-fordgobike-tripdata.csv')

# drop missing vallues
gobike_df = gobike_df.dropna()
In [3]:
# convert features to datetime dtype
gobike_df['start_time']=pd.to_datetime(gobike_df['start_time'])
gobike_df['end_time']=pd.to_datetime(gobike_df['end_time'])

gobike_df['hour_of_day'] = gobike_df.start_time.dt.hour.astype(int)
gobike_df['day_of_week'] = gobike_df.start_time.dt.strftime('%a')
#gobike_df['month_of_year'] = pd.DatetimeIndex(gobike_df['start_time']).month
#gobike_df['month_of_year'] = gobike_df['month_of_year'].astype(int).apply(lambda x: calendar.month_abbr[x])
gobike_df['member_age'] = 2022-gobike_df['member_birth_year'].astype(int)

The distribution of trip duration and member age¶

The duration in the dataset take on a very large range of values, from 60 seconds at the lowest to 84548 seconds at the highest for duration. The member_age from 21 years to 144 years for age. After removing outliers from both features and plotted on a logarithmic scale, the distribution of bike ride duration look more like normal distribution with peak at little less than 550 seconds.¶

In [4]:
log_binsize=0.025
bin_edges=10**np.arange(0,
np.log10(gobike_df.duration_sec.max())+log_binsize, log_binsize)
plt.figure(figsize=[8,6])
plt.hist(data=gobike_df, x='duration_sec', bins=bin_edges)
plt.xscale('log')
plt.xticks([50, 200, 500, 1500, 3000, 6000],
[50, 200, 500, 1500, 3000, 6000])
plt.xlabel('Duration (seconds)')
plt.xlim([50,6000])
plt.title('Distribution of Trip Duration (seconds)', fontsize=20)
plt.show()
In [5]:
binsize=3
bin_edges=np.arange(20, gobike_df.member_age.max()+binsize, binsize)
plt.figure(figsize=[8,6])
plt.hist(data=gobike_df, x='member_age', bins=bin_edges)
plt.xlabel('Age (years)')
plt.xlim([15,80])
plt.title('Age Distribution', fontsize=20)
plt.show()

Proportion of user types on all bike trips taken¶

Majority of the bike users are Subscribers (90.7%) and Customers (for those who do not subscribe) were 9.3%.¶

In [6]:
#filtering the values less than 6000 in `duration_sec` and less than 80 in `member_age`
outliers=((gobike_df.duration_sec>6000)|(gobike_df.member_age>80))
outlier_proportion = (outliers.sum()/gobike_df.shape[0])*100

gobike_df=gobike_df[-outliers]
In [7]:
# Plot bar chart in %
plt.figure(figsize=[8,6])
explode = (0, 0.1) 
sorted_counts = gobike_df['user_type'].value_counts()
plt.pie(sorted_counts, explode=explode, labels = sorted_counts.index, 
        autopct='%1.1f%%',shadow=True, startangle = 90,counterclock = False)
plt.title('Subscriber vs. Customer (in %)', fontsize=14, fontweight='bold');

Bikes rides vs time_of_day and day _of_week¶

Below, we see that both subscribers and customers take longer trips during weekend than weekdays. Subscribers take most of their trips around 7-9am (peak at 8am) and 4-6pm (peak at 5pm), typical commute hours. They are probably commuters to work/school.¶

In [8]:
# plotting hour of the day and day of the week together
fig, ax=plt.subplots(nrows=2, figsize=[10,8])
default_color=sns.color_palette()[0]
sns.countplot(data=gobike_df, x='hour_of_day', color=default_color, ax=ax[0])
order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
sns.countplot(data=gobike_df, x='day_of_week', color=default_color, ax=ax[1], order=order)

fig.suptitle('Trips count by hour and day', fontsize=20)
plt.show()

Relationship between duration and age with the the days of the week.¶

Bike rides on weekends (Sat-Sun) have longer trip durations as compared to bike rides on other weekdays (Mon-Fri) and on average, users who are ‘Customer’ have longer bike trip durations ascompared to users who are ‘Subscriber’. Also, Bikers on weekdays (Mon-Fri) tend to be older than bikers on weekends(Sat-Sun).¶

In [9]:
sns.boxplot(data=gobike_df, x="day_of_week", y="member_age", showfliers=False, order=order);
plt.title('Age of riders by days of the week', fontsize=15);
In [10]:
sns.boxplot(data=gobike_df, x="day_of_week", y="duration_sec", showfliers=False, order=order);
plt.title('Duration of trips taken by days of the week', fontsize=15);
In [11]:
sns.boxplot(data=gobike_df, x="user_type", y="duration_sec", showfliers=False);
plt.title('Duration of the trips taken by the type of users', fontsize=15);
In [12]:
sns.boxplot(data=gobike_df, x="user_type", y="member_age", showfliers=False);
plt.title('Age of the type of users', fontsize=15);

Female bikers mostly like to take their trip around the city of San Jose in California. Unknown gender ride to and fro, between San Francisco and Oakland. Also, from the map below, the male gender seemed to be overshadowed by the female and other gender in both San Francisco and San Jose. Although the male are the most bike riders, it could be that these male riders have a very special location in San Francisco and San Jose where they ride their bikes.¶

In [13]:
#plotting a mapbox for non-deviants and positive-deviants in domain 4
fig = px.scatter_mapbox(gobike_df, lat='start_station_latitude', lon='start_station_longitude', 
                        width=800, zoom=4, color='member_gender', 
                        height=600, hover_data=['user_type'],
                       )
fig.update_layout(mapbox_style='open-street-map')

fig.show()
In [14]:
!jupyter nbconvert FORD_GOBIKE_EXPLANATORY_ANALYSIS.ipynb --to slides --post serve --no-input --no-prompt
[NbConvertApp] Converting notebook FORD_GOBIKE_EXPLANATORY_ANALYSIS.ipynb to slides
[NbConvertApp] Writing 11832619 bytes to FORD_GOBIKE_EXPLANATORY_ANALYSIS.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Traceback (most recent call last):
  File "C:\Users\PANDORA\anaconda3\Scripts\jupyter-nbconvert-script.py", line 10, in <module>
    sys.exit(main())
  File "C:\Users\PANDORA\anaconda3\lib\site-packages\jupyter_core\application.py", line 269, in launch_instance
    return super().launch_instance(argv=argv, **kwargs)
  File "C:\Users\PANDORA\anaconda3\lib\site-packages\traitlets\config\application.py", line 846, in launch_instance
    app.start()
  File "C:\Users\PANDORA\AppData\Roaming\Python\Python39\site-packages\nbconvert\nbconvertapp.py", line 414, in start
    self.convert_notebooks()
  File "C:\Users\PANDORA\AppData\Roaming\Python\Python39\site-packages\nbconvert\nbconvertapp.py", line 588, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "C:\Users\PANDORA\AppData\Roaming\Python\Python39\site-packages\nbconvert\nbconvertapp.py", line 555, in convert_single_notebook
    self.postprocess_single_notebook(write_results)
  File "C:\Users\PANDORA\AppData\Roaming\Python\Python39\site-packages\nbconvert\nbconvertapp.py", line 525, in postprocess_single_notebook
    self.postprocessor(write_results)
  File "C:\Users\PANDORA\AppData\Roaming\Python\Python39\site-packages\nbconvert\postprocessors\base.py", line 27, in __call__
    self.postprocess(input)
  File "C:\Users\PANDORA\AppData\Roaming\Python\Python39\site-packages\nbconvert\postprocessors\serve.py", line 91, in postprocess
    http_server.listen(self.port, address=self.ip)
  File "C:\Users\PANDORA\anaconda3\lib\site-packages\tornado\tcpserver.py", line 151, in listen
    sockets = bind_sockets(port, address=address)
  File "C:\Users\PANDORA\anaconda3\lib\site-packages\tornado\netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
In [ ]: